Nathaniel Dake Blog

1. Time Series Forecasting

If you have been following along with my posts you may have realized that something I hadn't spent a lot of time dealing with was time series and subsequent forecasting. I have dealt with sequences (both via Recurrent Neural Networks and Markov Models), but given vast amount of time series data that you can encounter in industry, this post is long over due.

Before digging into the theory, I think that it may be the most beneficial to start from the top and work our way down. Namely, I'd like to get every one exposed to the actual interaction with time series data through the pandas library, and then we will move on to specific forecasting techniques and dive into the mathematics. The reason for this is because in my experience, getting the time series data into the correct format and manipulating it as needed is a bit more challenging than the traditional ML preprocessing (especially if you have never worked with it before). With that said, let's get started!

1. Datetimes in Python, Numpy and Pandas

Built right in to native python is the ability to create a datetime object:

In [175]:
from datetime import datetime

# Date information
year = 2020
month = 1
day = 2

# Time information
hour = 13
mins = 30
sec = 15

date = datetime(year, month, day, hour, mins, sec)
print(date)
2020-01-02 13:30:15

Now, while python does have the built in ability to handle date times, numpy is more efficient when it comes to handling date's. The numpy data type for date times datetime64, here. It will be have a different type compared to python's built in datetime:

In [176]:
import numpy as np

np_date = np.array(["2020-03-15"], dtype="datetime64")


print(f"Pythons datetime type: {type(date)}\n", "Numpy datetime type: ", np_date.dtype)
Pythons datetime type: <class 'datetime.datetime'>
 Numpy datetime type:  datetime64[D]

We can take this one step further and actually create numpy date ranges. By using np.arange() we can create a date range easily as follows:

In [177]:
display(np.arange("2018-06-01", "2018-06-23", 7, dtype="datetime64[D]")) # Where the dtype specifies our step type
array(['2018-06-01', '2018-06-08', '2018-06-15', '2018-06-22'],
      dtype='datetime64[D]')

Pandas handles datetimes in a way that is built in top of numpy. The API to create a date range is shown below:

In [178]:
import pandas as pd

display(pd.date_range("2020-01-01", periods=7, freq="D")) # String code D stands for days
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', freq='D')

The return value is a pandas DatetimeIndex which is a specialized pandas index built for datetimes. In other words, it is not just normal string codes, rather it is aware that they are datetime64 objects. Pandas is able to handle a variety of string codes, however, we will stick with the standard year-month-day.

A nice helper method that pandas offers is the to_datetime method:

In [179]:
display(pd.to_datetime(["1/2/2018"]))
DatetimeIndex(['2018-01-02'], dtype='datetime64[ns]', freq=None)

Which again returns a DatetimeIndex. An interesting thing to note is that if we don't pass in a list above, we receive a Timestamp object instead:

In [180]:
display(pd.to_datetime("1/2/2018"))
Timestamp('2018-01-02 00:00:00')

Another common bit of preprocessing that you will most certainly come across is receiving date times in a format that is not expected/the default date time format. For instance, imagine that you are being sent a series of date times from an external database and the format is:

"2018--1--2"

Well, thankfully pandas to_datetime has a format key word argument that can be used as follows:

In [181]:
display(pd.to_datetime(["2018--1--2"], format="%Y--%m--%d"))
display(pd.to_datetime(["2018/-1/-2"], format="%Y/-%m/-%d"))
DatetimeIndex(['2018-01-02'], dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2018-01-02'], dtype='datetime64[ns]', freq=None)

Finally, we can create a pandas dataframe with a date time index:

In [182]:
idx = pd.date_range("2020-01-01", periods=3, freq="D")
cols = ["A", "B"]
df = pd.DataFrame(np.random.randn(3, 2), index=idx, columns=cols)

display(df)
A B
2020-01-01 -0.029205 0.475634
2020-01-02 0.011961 -0.855672
2020-01-03 -0.427451 -1.033305

And we can see that our index is indeed comprised of datetime objects:

In [183]:
display(df.index)
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')

And also perform operations such as:

In [184]:
display(df.index.max())
Timestamp('2020-01-03 00:00:00')

1.1 Time Resampling

Now that we have an idea of how to deal with basic time series object creation in native python, numpy, and pandas, we can start perform more specific time series manipulation. To start, we can look at time resampling. This operates similar to a groupby operation, except we end up aggregating based off of some sort of time frequency.

For example, we could take daily data and resample it into monthly data (by taking the average, or some, or some other sort of aggregation). Let's look into this further with a real data set, a csv related to starbucks stock prices by date. We will read in our csv with a date index that is a datetime and not a string:

In [185]:
import boto3 
s3 = boto3.client('s3')

bucket = "intuitiveml-data-sets"
key = "starbucks.csv"

obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)

display(df.head())
Close Volume
Date
2015-01-02 38.0061 6906098
2015-01-05 37.2781 11623796
2015-01-06 36.9748 7664340
2015-01-07 37.8848 9732554
2015-01-08 38.4961 13170548

Seeing that our index is in fact a date time:

In [186]:
display(df.index)
DatetimeIndex(['2015-01-02', '2015-01-05', '2015-01-06', '2015-01-07',
               '2015-01-08', '2015-01-09', '2015-01-12', '2015-01-13',
               '2015-01-14', '2015-01-15',
               ...
               '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20',
               '2018-12-21', '2018-12-24', '2018-12-26', '2018-12-27',
               '2018-12-28', '2018-12-31'],
              dtype='datetime64[ns]', name='Date', length=1006, freq=None)

We can now perform a basic resampling as follows (the rule A is found here):

In [187]:
# daily ---> yearly
df_resampled = df["Close"].resample(rule="A").mean()
display(df_resampled)
Date
2015-12-31    50.078100
2016-12-31    53.891732
2017-12-31    55.457310
2018-12-31    56.870005
Freq: A-DEC, Name: Close, dtype: float64

A very cool feature is that we can even implement our own resampling functions (if mean, max, min, etc do not provide the necessary functionality):

In [188]:
def first_day(entry):
    if len(entry): return entry[0]
In [189]:
df_resampled = df["Close"].resample(rule="A").apply(first_day)
display(df_resampled)
Date
2015-12-31    38.0061
2016-12-31    55.0780
2017-12-31    53.1100
2018-12-31    56.3243
Freq: A-DEC, Name: Close, dtype: float64

We can of course combine this resampling with some basic plotting. Below, we can see the average closing price per year:

In [190]:
import matplotlib.pyplot as plt
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot

cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)
In [191]:
trace1 = go.Bar(
    x=df_resampled.index,
    y=df_resampled.values,
    marker = dict(
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=500,
    height=400,
    title="Yearly Mean Closing Price for Starbucks",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Mean Closing Price")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

We can of course perform the same sort of resampling at a monthly frequency as well:

In [192]:
df_resampled = df["Close"].resample(rule="M").max()
display(df_resampled.head())
Date
2015-01-31    41.5575
2015-02-28    44.2853
2015-03-31    45.8614
2015-04-30    48.5616
2015-05-31    48.8289
Freq: M, Name: Close, dtype: float64
In [193]:
# df_resampled.iplot(
#     kind="bar",
#     color="green",
#     title="Monthly max Closing Price for Starbucks",
#     xTitle='Date',
#     yTitle='Mean Closing Price',
#     dimensions=(650,350)
# )

trace1 = go.Bar(
    x=df_resampled.index,
    y=df_resampled.values,
    marker = dict(
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=500,
    height=400,
    title="Yearly Mean Closing Price for Starbucks",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Mean Closing Price")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

1.2 Time Shifting

Sometimes when working with time series data, you may need to shift it all up or down along the time series index. Pandas has built in methods that can easily accomplish this. Recall the head of our starbucks df

In [194]:
display(df.head())
Close Volume
Date
2015-01-02 38.0061 6906098
2015-01-05 37.2781 11623796
2015-01-06 36.9748 7664340
2015-01-07 37.8848 9732554
2015-01-08 38.4961 13170548

If we shift our rows by a single row we end up with the following:

In [195]:
display(df.shift(1).head())
Close Volume
Date
2015-01-02 NaN NaN
2015-01-05 38.0061 6906098.0
2015-01-06 37.2781 11623796.0
2015-01-07 36.9748 7664340.0
2015-01-08 37.8848 9732554.0

We can also shift based on frequency codes. For instance, we can shift everything forward one month:

In [196]:
display(df.shift(periods=1, freq="M").head())
Close Volume
Date
2015-01-31 38.0061 6906098
2015-01-31 37.2781 11623796
2015-01-31 36.9748 7664340
2015-01-31 37.8848 9732554
2015-01-31 38.4961 13170548

1.3 Rolling and Expanding

Let's now take a minute to go over rolling and expanding our time series data with pandas. The basic premise is that a common process when working with time series data is to create data based off of a rolling mean. So, what we can do is divide the data into windows of time, then calculate an aggregate for each moving window. In this way we will have calculated a simple moving average.

Recall that our closing price data looks like:

In [197]:
trace1 = go.Scatter(
    x=df.index,
    y=df.Close.values,
    marker = dict(
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=650,
    height=350,
    title="Closing Price for Starbucks",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Closing Price")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

What we are going to do is add in a rolling mean. A rolling mean simply create a little window, say 7 days, and it looks at each section of 7 days and performs some sort of aggregate function on it. In this case it will be a mean, or average. So every 7 days will will take the mean and keep rolling it along with that 7 day window.

In [198]:
trace1 = go.Scatter(
    x=df.index,
    y=df.Close.rolling(window=7).mean().values,
    marker = dict(
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=650,
    height=350,
    title="Rolling Average",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Closing Price")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

To be sure that this is clear, look at the first 10 rows of our dataframe:

In [199]:
df["Close"].head(10)
Out[199]:
Date
2015-01-02    38.0061
2015-01-05    37.2781
2015-01-06    36.9748
2015-01-07    37.8848
2015-01-08    38.4961
2015-01-09    37.2361
2015-01-12    37.4415
2015-01-13    37.7401
2015-01-14    37.5301
2015-01-15    37.1381
Name: Close, dtype: float64

Now, the rolling averages are found as follows:

In [200]:
for i in range(0, 4):
    print(f"Window {i+1}, the average of rows {i}:{i+7} ->", df["Close"][i:i+7].mean())
Window 1, the average of rows 0:7 -> 37.61678571428571
Window 2, the average of rows 1:8 -> 37.57878571428571
Window 3, the average of rows 2:9 -> 37.61478571428571
Window 4, the average of rows 3:10 -> 37.63811428571428

Which we can see is equivalent to the rolling average found via pandas:

In [201]:
df["Close"].rolling(
    window=7
).mean()[6:10]
Out[201]:
Date
2015-01-12    37.616786
2015-01-13    37.578786
2015-01-14    37.614786
2015-01-15    37.638114
Name: Close, dtype: float64

Let's now overlay our original closing price with the rolling average:

In [202]:
df_rolling_window = df["Close"].rolling(
    window=7
).mean()

trace1 = go.Scatter(
    x = df_rolling_window.index,
    y = df_rolling_window.values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'orange',
    ),
    name="Rolling Mean, Window = 7 days"
)

trace2 = go.Scatter(
    x = df["Close"].index,
    y = df["Close"].values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'blue',
    ),
    name="Original"
)
data = [trace2, trace1]

layout=go.Layout(
    title="Rolling Average (7 day window) vs. No transformation Starbucks Closing Price",
    width=950,
    height=500,
    xaxis=dict(title="Date"),
    yaxis=dict(title='Closing Price'),
    legend=dict(x=0.05, y=1)
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

We can of course increase our window size, and we will subsequently see more smoothing:

In [203]:
df_rolling_window = df["Close"].rolling(
    window=30
).mean()

trace1 = go.Scatter(
    x = df_rolling_window.index,
    y = df_rolling_window.values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'orange',
    ),
    name="Rolling Mean, Window = 30 days"
)

trace2 = go.Scatter(
    x = df["Close"].index,
    y = df["Close"].values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'blue',
    ),
    name="Original"
)
data = [trace2, trace1]

layout=go.Layout(
    title="Rolling Average (30 day window) vs. No transformation Starbucks Closing Price",
    width=950,
    height=500,
    xaxis=dict(title="Date"),
    yaxis=dict(title='Closing Price'),
    legend=dict(x=0.05, y=1)
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

As we continue increasing the window size, we can see that we are viewing a more general trend. Now, in addition to rolling windows we can also work with expanding. For instance, what if we wanted to take into account everything from the start of the time series up to each point in time (aka a cumulative average). This would work as follows:

In [217]:
df_expanding = df["Close"].expanding().mean()

trace1 = go.Scatter(
    x=df_expanding.index,
    y=df_expanding.values,
    marker = dict(
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=650,
    height=350,
    title="Expanding Closing Price Average",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Closing Price")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

We can see that this curve is clearly logarithmic, as an expanding mean generally will be.

2. Time Series Analysis

I want us to now move on to learning about the main tool we will use in time series forecasting: the Statsmodels library. Statsmodels can be thoughts of as follow:

A python module that provides classes and function for the estimation of many different statistical models, as well as conducting statistical tests, and statistical data exploration.

Keep in mind that we won't really be doing any forecasting yet. Rather, we will be familiarizing with the statsmodels library and some of the statistical tests that you can be performing on time series data.

2.1 Properties of Time Series Data

Let's look at some basic properties of time series data. To begin, we have trends. Time series can have trends, as seen below:

In [205]:
from plotly import tools

x = np.arange(0,50,0.01)

y_stationary = np.sin(x)
y_upward = x*0.1 + np.sin(x)
y_downward = np.sin(x) - x*0.1

trace1 = go.Scatter(
    x=x,
    y=y_stationary
)

trace2 = go.Scatter(
    x=x,
    y=y_upward,
    xaxis='x2',
    yaxis='y2'
)

trace3 = go.Scatter(
    x=x,
    y=y_downward,
    xaxis='x3',
    yaxis='y3'
)

data = [trace1, trace2, trace3]
layout = go.Layout(
    showlegend=False
)

fig = tools.make_subplots(
    rows=1,
    cols=3,
    subplot_titles=("Stationary", "Upward", "Downward"),
    print_grid=False
)

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)

fig['layout']['yaxis1'].update(range=[-3, 3])
fig['layout'].update(
    showlegend=False,
    height=300
)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

Above, we have can see upward, stationary, and downward trends. Time series will always exhibit one of the above trends. Additionally, time series can exhibit seasonality, a repeating trend:

In [206]:
s3 = boto3.client('s3')
bucket = "intuitiveml-data-sets"
key = "monthly_milk_production.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)

clipped_df = df["1962-01-01":"1968-01-01"]
trace1 = go.Scatter(
    x=clipped_df.index,
    y=clipped_df["Production"].values,
    marker = dict(
        size = 6,
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=800,
    height=400,
    title="Seasonality and Upward Trend",
    xaxis=dict(title="Date")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

We can clearly see in the plot above that their is a seasonality trend associated with the data. At around the 3rd month of each year we observe a peak. It also looks as though this trend repeats every cycle. Hence, overall it looks like the volume of search results is going down.

Finally, we also have cyclical components. Cyclical components are trends that have no set repetition. Here is does look like they are trends, however, it does not look like it occurs on a regular cycle.

In [207]:
bucket = "intuitiveml-data-sets"
key = "starbucks.csv"

obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)

trace1 = go.Scatter(
    x=df.index,
    y=df["Close"].values,
    marker = dict(
        size = 6,
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=800,
    height=400,
    title="Cyclical and Upward Trend",
    xaxis=dict(title="Date")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
In [ ]:
 
In [ ]:
 

© 2018 Nathaniel Dake